Description of the dataset

Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:
[@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

  1. Title: Wine Quality

  2. Sources

    Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

  3. Past Usage:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

    In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

  4. Relevant Information:

    The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

  5. Number of Instances: red wine - 1599; white wine - 4898.

  6. Number of Attributes: 11 + output attribute

    Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

  7. Attribute information:

    For more information, read [Cortez et al., 2009].

    Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data):
    12 - quality (score between 0 and 10)

  8. Missing Attribute Values: None

  9. Description of attributes:

    1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

    2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

    3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

    4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

    5 - chlorides: the amount of salt in the wine

    6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

    7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

    8 - density: the density of wine is close to that of water depending on the percent alcohol and sugar content

    9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

    10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

    11 - alcohol: the percent alcohol content of the wine

    12 - quality (score between 0 and 10)

## 'data.frame':    1599 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.bucket      : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ total.acidity       : num  8.1 8.68 8.6 12.04 8.1 ...

There are 1599 wines evaluated in the dataset Evaluation is based on 12 variables (11 continuous and 1 discreet) Two extra variable are added to the dataset:
- quality.bucket = categorical variable for quality
- total acidity = fixed.acidity + volatile.acidity + citric.acid

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Red wines are normally distributed in quality with notes ranging from 3 (poor) to 8 (excellent).
Median note is 6 and mean is slightly lower at 5.6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity is ranging from 4.6 g / dm^3 to 15.9 g / dm^3.
Distribution is positively skewed (median 5.9 g / dm^3, mean 8.32 g / dm^3).
Fixed acidity seems to have a weak influence on wine rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile acidity is ranging from 0.12 g / dm^3 to 1.58 g / dm^3.
Distribution is close to normal (median 0.52 g / dm^3, mean 0.5278 g / dm^3).
There is one outlier with value 1.5 g / dm^3.
Wines with good rating (7 or 8) tend to have less volatile acidity than the average wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric acid is ranging from 0 g / dm^3 to 1 g / dm^3.
Distribution is positively skewed (median 0.26 g / dm^3, mean 0.271 g / dm^3).
Distribution has several modes at 0 g / dm^3, 0.25 g / dm^3 and 0.5 g / dm^3.
There is one outlier with value 1 g / dm^3.
Wines with good rating (7 or 8) tend to have more citric acid than the average wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.270   7.827   8.720   9.118  10.070  17.045

Total acidity is ranging from 5.27 g / dm^3 to 17.075 g / dm^3.
Distribution is positively skewed (median 8.72 g / dm^3, mean 9.118 g / dm^3).
There are some outliers with values above 15 g / dm^3.
Total acidity seems to have a weak influence on wine rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Residual sugar is ranging from 0.9 g / dm^3 to 15.5 g / dm^3.
There are no sweet wine in the dataset (residual sugar above 45 g / dm^3).
Distribution is close to normal (median 2.2 g / dm^3, mean 2.539 g / dm^3).
Some outliers with values above 4 g / dm^3.
Residual sugar seems to have a weak influence on wine rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides is ranging from 0.012 g / dm^3 to 0.611 g / dm^3.
Distribution is close to normal (median 0.079 g / dm^3, mean 0.08747 g / dm^3).
There are few outliers for values above 0.3 g / dm^3.
Chlorides seems to have a weak influence on wine rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free sulfur dioxide is ranging from 1 mg / dm^3 to 72 mg / dm^3.
Distribution is positively skewed (median 14 mg / dm^3, mean 15.87 mg / dm^3).
Free sulfur seems to have a weak influence on wine rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Total sulfur dioxide is ranging from 6 mg / dm^3 to 289 mg / dm^3.
Distribution is positively skewed (median 38 mg / dm^3, mean 46.47 mg / dm^3).
There are few outliers for values above 200 mg / dm^3.
Total sulfur dioxide seems to have a weak influence on wine rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Density is ranging from 0.9901 g / cm^3 to 1.0037 g / cm^3.
Distribution is close to normal (median 0.9968 g / cm^3, mean 0.9967 g / cm^3).
Density seems to have a weak influence on wine rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH is ranging from 2.74 to 4.01.
Distribution is close to normal (median 3.31, mean 3.311).
pH seems to have a weak influence on wine rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates is ranging from 0.33 g / dm^3 to 2 g / dm^3.
Distribution is positively skewed (median 0.62 g / dm^3, mean 0.6581 g / dm^3).
There are few outliers for values above 1.5 g / dm^3.
Wines with good rating (7 or 8) tend to have slightly more sulphates than the average wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol is ranging from 8.4 % to 14.9 %.
Distribution is positively skewed (median 10.2 %, mean 10.42 %).
There are few outliers for values above 14 %.
Wines with good rating (7 or 8) tend to have more alcohol than the average wine.

Univariate Analysis

Structure of the dataset

There are 1599 wines evaluated in the dataset

Description of each wine is based on 12 variables
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
12 - quality (score between 0 and 10)

Main feature of interest in the dataset

The main feature of interest in the dataset is the quality of the wine. It is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Other feature linked to the main feature of interest

Based on the univariate analysis of each variable faceted by wine quality, following features have been identified as potentially correlated with wine quality:
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)

New variable

The total.acidity variable (sum of all acidity items) is added to the dataset.

Correction for skewed distribution

Sometimes, distributions of variable accross the population were skewed (positively). I applied a logarythmic transformation to bring the distribution back to normal.
This applies to following variables:
1 - fixed acidity (tartaric acid - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
10 - sulphates (potassium sulphate - g / dm3)

Bivariate Plots Section

Pairplots

Martix plot allows to better understand the correlation between the variables of the dataset.
It confirms that quality is mainly correlated with alcohol, sulphates, citric acid (positively) and volatile acidity (negatively).
Several links between supporting variables are also identified:
- pH and acidity (which is quite obvious)
- density with alcohol, acidity and residual sugar

Other bivariate plots

## 
##  Pearson's product-moment correlation
## 
## data:  wineQualityReds$quality and wineQualityReds$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Bivariate plot quality vs alcohol confirms our intuition and shows a clear correlation between both variables.
Pearson correlation coefficient between alcohol and quality is 0.48

## 
##  Pearson's product-moment correlation
## 
## data:  wineQualityReds$quality and log10(wineQualityReds$sulphates)
## t = 12.967, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2636092 0.3523323
## sample estimates:
##       cor 
## 0.3086419

Bivariate plot quality vs log10(sulphates) also shows a correlation.
Pearson correlation coefficient between log10(sulphates) and quality is 0.3

## 
##  Pearson's product-moment correlation
## 
## data:  wineQualityReds$quality and wineQualityReds$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

Bivariate plot quality vs citric.acid shows a weak correlation.
Pearson correlation coefficient between citric.acid and quality is 0.22

## 
##  Pearson's product-moment correlation
## 
## data:  wineQualityReds$quality and wineQualityReds$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

Bivariate plot quality vs volatile.acidity shows a negative correlation.
Pearson correlation coefficient between volatile.acidity and quality is -0.39

## 
##  Pearson's product-moment correlation
## 
## data:  wineQualityReds$density and wineQualityReds$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798
## 
##  Pearson's product-moment correlation
## 
## data:  wineQualityReds$density and wineQualityReds$fixed.acidity
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473
## 
##  Pearson's product-moment correlation
## 
## data:  wineQualityReds$density and log10(wineQualityReds$residual.sugar)
## t = 18.363, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3762175 0.4572012
## sample estimates:
##       cor 
## 0.4175381
## 
##  Pearson's product-moment correlation
## 
## data:  wineQualityReds$density and wineQualityReds$pH
## t = -14.53, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3842835 -0.2976642
## sample estimates:
##        cor 
## -0.3416993

There are clear correlations between wine density and alcohol (-O.5), fixed acidity (0.67), residual sugar (0.42 with log10(residual.sugar)) and pH (-0.34).
The strongest correlation is with fixed acidity variable which was not obvisous from a physical stand point (correlation with alcohol was easier to infer)

Bivariate Analysis

Relationship between wine quality and other features from the dataset

Main correlation with wine quality are with following variables:
- alcohol –> positive correlation: the more alcohol the merrier
- sulphates –> positive correlation which could not be suspected beforehand
- citric acid –> positive correlation, in line with the description in the dataset: “found in small quantities, citric acid can add ‘freshness’ and flavor to wines”"
- volatile acidity –> negative correlation, in line with the description of the dataset : “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”

With respect to the main feature of interest (wine quality), the strongest relationship is with alcohol (pearson correlation coefficient of 0.48)

Relationship between wine density and other features from the dataset

There are strong correlation between density of the wine and several other variables: alcohol, acidity and residual sugar

Multivariate Plots Section

Plot of Alcohol vs sulphates colored by wine quality indicates that wine with higher levels of alcohol and sulphates are usualy of better quality (shade of red goes from the lower left corner to upper right corner of the plot).

Plot of Alcohol vs citric acid colored by wine quality indicates that wine with higher levels of alcohol and citric acid are usualy of better quality (shade of red goes from the lower left corner to upper right corner of the plot).

Plot of Alcohol vs volatile acidity colored by wine quality indicates that wine with higher levels of alcohol and lower level of volatile acidity are usualy of better quality (shade of red goes from the upper left corner to lower right corner of the plot).

Multivariate Analysis

Relationship between wine quality and other features from the dataset

Multivariate plots of alcohol vs sulphates, citric acid and volatile acidity colored by wine quality show that correlations found in the bivariate analysis are confirmed. The multivariate analysis indicated that correlations are even strengthened. Wine quality evolution is seen by a kind of coordinated evolution of the features.

Given that I have to preconceived idea about wine chemical features, I did not find surprising interactions between features.

Model Elaboration

Wine quality linear model

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wineQualityReds)
## m2: lm(formula = quality ~ alcohol + I(log10(sulphates)), data = wineQualityReds)
## m3: lm(formula = quality ~ alcohol + I(log10(sulphates)) + citric.acid, 
##     data = wineQualityReds)
## m4: lm(formula = quality ~ alcohol + I(log10(sulphates)) + citric.acid + 
##     volatile.acidity, data = wineQualityReds)
## 
## ===============================================================================
##                             m1            m2            m3            m4       
## -------------------------------------------------------------------------------
##   (Intercept)              1.875***      2.541***      2.421***      3.444***  
##                           (0.175)       (0.177)       (0.178)       (0.196)    
##   alcohol                  0.361***      0.335***      0.330***      0.303***  
##                           (0.017)       (0.016)       (0.016)       (0.016)    
##   I(log10(sulphates))                    2.070***      1.781***      1.518***  
##                                         (0.177)       (0.186)       (0.181)    
##   citric.acid                                          0.446***     -0.113     
##                                                       (0.092)       (0.103)    
##   volatile.acidity                                                  -1.217***  
##                                                                     (0.112)    
## -------------------------------------------------------------------------------
##   R-squared                0.227         0.288         0.298         0.346     
##   adj. R-squared           0.226         0.287         0.296         0.344     
##   sigma                    0.710         0.682         0.677         0.654     
##   F                      468.267       322.031       225.436       210.808     
##   p                        0.000         0.000         0.000         0.000     
##   Log-likelihood       -1721.057     -1655.601     -1644.025     -1587.153     
##   Deviance               805.870       742.522       731.849       681.597     
##   AIC                   3448.114      3319.202      3298.051      3186.306     
##   BIC                   3464.245      3340.711      3324.937      3218.569     
##   N                     1599          1599          1599          1599         
## ===============================================================================

Linear models for wine quality are build with increasing complexity (from one to four features). Increasing the number of features helped increasing the accuracy of the model (R2 increased and sigma decreased when features are added).
The linear model for wine quality based on the 4 variables which were found to be most correlated with wine quality (alcohol, sulphates, citric acid and volatile acidity) still has poor results in term of R2 (R2 = 0.346 while a good regression should be close to 1). Probably it is due to the fact that the linear model gives continuous results while the wine quality is descreet (grades from 3 to 8). In fact modeling the wine quality is more a classification task than a regression one.
Results in term of standard deviation are more satisfactory (sigma = 0.654). This means that at 95% confidence, results of the model are not further than 1.3 from the actual wine quality. This model should be able to help non wine experts in their wine selection.

Wine density linear model

## 
## Calls:
## m1: lm(formula = density ~ alcohol, data = wineQualityReds)
## m2: lm(formula = density ~ alcohol + fixed.acidity, data = wineQualityReds)
## m3: lm(formula = density ~ alcohol + fixed.acidity + I(log10(residual.sugar)), 
##     data = wineQualityReds)
## m4: lm(formula = density ~ alcohol + fixed.acidity + I(log10(residual.sugar)) + 
##     pH, data = wineQualityReds)
## 
## ========================================================================================
##                                  m1             m2             m3             m4        
## ----------------------------------------------------------------------------------------
##   (Intercept)                    1.006***       0.999***       0.999***       0.983***  
##                                 (0.000)        (0.000)        (0.000)        (0.001)    
##   alcohol                       -0.001***      -0.001***      -0.001***      -0.001***  
##                                 (0.000)        (0.000)        (0.000)        (0.000)    
##   fixed.acidity                                 0.001***       0.001***       0.001***  
##                                                (0.000)        (0.000)        (0.000)    
##   I(log10(residual.sugar))                                     0.004***       0.004***  
##                                                               (0.000)        (0.000)    
##   pH                                                                          0.004***  
##                                                                              (0.000)    
## ----------------------------------------------------------------------------------------
##   R-squared                      0.246          0.654          0.776          0.843     
##   adj. R-squared                 0.246          0.654          0.776          0.843     
##   sigma                          0.002          0.001          0.001          0.001     
##   F                            521.583       1508.935       1843.511       2146.396     
##   p                              0.000          0.000          0.000          0.000     
##   Log-likelihood              7987.444       8610.211       8958.189       9243.872     
##   Deviance                       0.004          0.002          0.001          0.001     
##   AIC                       -15968.888     -17212.423     -17906.379     -18475.744     
##   BIC                       -15952.757     -17190.914     -17879.493     -18443.481     
##   N                           1599           1599           1599           1599         
## ========================================================================================

Linear models for wine density are build with increasing complexity (from one to four features). Increasing the number of features helped increasing the accuracy of the model (R2 increased and sigma decreased when features are added).
The linear model for wine density based on the 4 variables which were found to be most correlated with wine quality (alcohol, fixed acidity, residual sugar and pH) has satisfactory results in term of R2 and sigma (R2 = 0.843 and sigma = 0.001).


Final Plots and Summary

Wine quality histogram

Red wines are normally distributed in quality with notes ranging from 3 (poor) to 8 (excellent).
Median note is 6 and mean is slightly lower at 5.6.

Wine density vs alcohol, fixed acidity, residual sugar and pH

There exist clear correlations between wine density and alcohol (-O.5), fixed acidity (0.67), residual sugar (0.42 with log10(residual.sugar)) and pH (-0.34).
The strongest correlation is with fixed acidity variable which was not obvisous from a physical stand point (correlation with alcohol was easier to infer).
It is possible to build a linear model for wine density based on the four features plotted above. Outputs of the model are satisfactory with R2 = 0.843 and sigma = 0.001.

Wine quality function of alcohol and sulphates

Plot of Alcohol vs sulphates colored by wine quality indicates that wine with higher levels of alcohol and sulphates are usualy of better quality (shade of red goes from the lower left corner to upper right corner of the plot).
It is possible to build a linear model for wine quality based on the four main features correlated with wine quality: alcohol, sulphates, citric acid and volatile acidity.
Outputs of the model are not so good in term of R2 due to the fact that wine quality is a discreet variable. Nevertheless, standard deviation of the model is acceptable (sigma = 0.654).


Reflection

I used the Exploratory Data Analysis technic with R to investigate a completely unknown dataset. I’m no expert in wine and the variables gathered in the dataset were quite specific resulting from a chemical analysis (for the 11 features) or a wine testing (wine quality). At first, I did not know how I would extract any information from this bunch of numbers. Then, proceeding step by step (univariate analysis, bivariate analysis, multivariate analysis), I managed to have a better idea on how wine quality as perceived by some wine experts was correlated with some chemical features. I hope the plots I produced reflect well my understanding of the underlying links between wine quality and the main wine chemical features.
At first, I was a bit disapointed by the results of the linear model I produced for wine quality. Nevertheless, I’m convinced it is rather a classification task and it is only normal that a regression tool gives poor results. I would be interested in using a classification machine learning algorithm on this dataset.